To the Line Producer of Company X,
Is there a model that can help predict the future box office success of a film? How can production companies better predict which will be box office hits and which will be box office flops? Take for example, Andrew Stanton, coming off of the success of Finding Nemo (2004) and Wall E (2008) was lined up to direct another box office hit, John Carter (2012) catered towards a similar audience but did not receive the same box office warmth as his former two outings.
The following report aims to show what seem to be relevant factors that affect box office sales of a particular movie. We will be using a variety of modeling techniques (Random Forest, Decision Tree, Linear Regression) to aid us in predicting how much a a film will make given the its budget, run time, genre, crew size, year released, and week day released among a host of other variables to consider.
A word of caution before we proceed with the development, fine tuning, and interpretation of our models. The models we are about to create will only be as good as the data that is fed into them. Throughout the Exploratory Data Analysis, I will be sure to highlight where the data used to create the model may not be reflective of the true distribution of films released worldwide which made lead our model to overweight particular films.
It is my hope that this report will give you a better understanding of some underlying economic forces that determine a film’s box office and help you plan accordingly for your film’s budget. Some limitations of this report include the fact that it does not participate in two deeper layers of analysis:
1.) The following models, in the author’s opinion lack the nuance in identifying the idea that the creative process and artistic merit a film doesn’t always equate to a financial success. This model lacked a variety of metrics for success, mainly using revenue and popularity as a response variables and explanatory variables, respectively. If this analysis and modeling were to be redone with more extensive data, I would aim to include other metrics of success such as rotten tomato scores, and number of award nominations.
2.) A further break down of demographics in the revenue a film generates. The data on film revenue is an aggregate across several demographics and is impossible to determine a complete breakdown by demographics. This layer of analysis would be pertinent to a line producer’s job to determine the viability a film project given your specific demographics. For example, a line producer drafting a budget for an independent film in urban New York City will most likely have a different demographic to keep in mind than a blockbuster film in a rural part of Mexico.. This starts to enter the realm of marketing and segmenting for specific customers.
In other words, the author of this report acknowledges that what the model proposes to be a good formula for a box office hit may not lead to perfectly ethical casting or creative decisions.
But enough with the the opening credits, as they say in the industry: “Lights, camera, ….action!”
First, let’s load the libraries we will be using:
library(tidyverse)
library(tidymodels)
library(lubridate)
library(robotstxt)
library(data.table)
library(gridExtra)
library(glue)
library(rpart.plot)
library(reactable)
library(naniar)
library(visdat)
library(wesanderson)
library(pander)
The loading and cleaning of the data largely mirrors the process outlined by Saba Tavoosi
#Loading the Data
films_train <- read.csv("data/train.csv", na.strings=c("", '#N/A', '[]', '0'))
films_test <- read.csv("data/test.csv", na.strings=c("", '#N/A', '[]', '0'))
# Cleaning function
clean_data <- function(dataset){
data <- dataset %>%
select(-`poster_path`, - `tagline`, - `overview`, `homepage`) %>% #removing variables that won't be considered for further analysis
mutate(release_date = parse_date_time2(release_date, "mdy", cutoff_2000 = 20), #ensuring years after 2000 are correctly identified in the right century.
release_year = lubridate::year(ymd(release_date)), #grab year number from release_date
release_month = lubridate::month(ymd(release_date)), #grab month name from release_date
release_month_abbr = factor(month.abb[release_month],levels=month.abb),
release_quarter = lubridate::quarter(ymd(release_date)), #grab quarter number from release_date
release_week = lubridate::week(ymd(release_date)), #grab week number from release_date
release_wday = lubridate::wday(ymd(release_date)), #grab weekday number from release_date
release_wday_name = weekdays.Date(release_date, abbreviate = TRUE), #grab weekday abbreviated name from release_date
original_language = as.factor(original_language),
is_english = case_when(original_language == "en" ~ "English",
original_language != "en" ~ "Non English"), #new variable of two levels identifying whether it is an english or non-english speaking film.
genres = as.factor(genres),
main_genre = str_extract(genres, "Comedy|Horror|Action|Drama|Documentary|Science Fiction|
Crime|Fantasy|Thriller|Animation|Adventure|Mystery|War|Romance|Music|
Family|Western|History|TV Movie|Foreign"), #identifies the main genre of the movie according choosing from the given list
status = as.factor(status),
series = str_extract(belongs_to_collection, "(?<=name\\'\\:\\s{1}\\').+(?=\\'\\,\\s{1}\\'poster)"),
series = ifelse(!is.na(series), as.character(series), "No collection"), # filling NA values with "No Collection"
production_companies = gsub('(^\\[\\{\'name\'\\:\\s\'|\'\\,\\s\'id.*)', '',
production_companies),
production_countries = str_extract(production_countries, "[:upper:]+"),
top_prod_comp = case_when(production_companies == 'Universal Pictures' ~ 'Universal Pictures',
production_companies == 'Paramount Pictures' ~ 'Paramount Pictures',
production_companies == 'Twentieth Century Fox Film Corporation' ~ 'Twentieth Century Fox Film Corporation',
production_companies == 'Columbia Pictures' ~ 'Columbia Pictures',
production_companies == 'New Line Cinema' ~ 'New Line Cinema',
production_companies == 'Warner Bros.' ~ 'Warner Bros.',
production_companies == 'Walt Disney Pictures' ~ 'Walt Disney Pictures'),
top_prod_comp = ifelse(!is.na(top_prod_comp), as.character(top_prod_comp), "Other"), #filling NA values with "Other"
is_top_prod_comp = ifelse(top_prod_comp == "Other", "No", "Yes"),
part_of_franchise = ifelse(!is.na(series), "Yes", "No"),
all_cast_size = str_count(cast, "name"),
female_cast_size = str_count(cast, ('gender\'\\:\\s1')),
male_cast_size = str_count(cast, ('gender\'\\:\\s2')),
all_crew_size = str_count(crew, 'name'),
female_crew_size = str_count(crew, ('gender\'\\:\\s1')),
male_crew_size = str_count(crew, ('gender\'\\:\\s2')),
genre_count = str_count(genres, 'name'),
filtered_na = case_when(is.na(budget) ~ "excluded",
is.na(runtime) ~"excluded",
is.na(all_cast_size) ~ "excluded",
is.na(female_cast_size) ~ "excluded",
is.na(genre_count) ~ "excluded",
is.na(production_companies) ~ "excluded"),
filtered_na = ifelse(!is.na(filtered_na), as.character(filtered_na), "included"),
budget = budget/1000000
)
first_in_series <- data %>%
filter(!is.na(series)) %>%
arrange(series,
release_date) %>%
group_by(series) %>%
slice(1) %>%
mutate(order = "before") %>%
select(imdb_id,
order)
data <- data %>%
left_join(first_in_series, by = c("imdb_id" = "imdb_id")) %>%
rename(series = series.x) %>%
mutate(order = ifelse(is.na(order), "after", as.character(order))) %>%
select(-c(`id`, `belongs_to_collection`, `homepage`, `imdb_id`, `status`, `title`, `Keywords`, series.y, )) #removing columns impertinent to future modeling.
}
#Cleaning the Data
films_train <- clean_data(films_train) %>%
mutate(revenue = revenue/1000000)
films_test <- clean_data(films_test)
films_all <- bind_rows(films_train, films_test)
A note about the titles of the data set: The original “test” data set had NA values for revenue as that column was left for modeling to fill in values. However, in order to be able to assess the accuracy of the training on the testing data set (hence, supervised learning), I needed a testing data set that had the “right answer” provided in order to see how far off my models were (Using metrics such as Mean Absolute Error and Mean Absolute Percentage Error). Hence, I decided to split the training data set into a training data set and testing data set.
Saba Tavoosi did well in extracting information from JSON data but I would recommend on providing more text analysis explaining her reasoning behind the regular expressions used. Some were convoluted and the rest her works depends on a level of trust that regular expressions are extracting the information she wanted.
Saba Tavoosi painstakingly and with accuracy populated her data set
with extra research and manually entering missing data for budget. This
communicates a high priority in creating a complete training set rather
than filtering out NA values for budget. Tavoosi could
alternatively join data sets together on the uniuqe key variable
imdb_id using API’s to accumulate more data with their
corresponding budget and box office information. A pro of Tavoosi’s
method is that it allows her to research individual films and populate
as many of her own variables (all_cast_size,
runtime) as she can . A con of Tavoosi’s method is the time
inefficiency in doing this process for several more cases.
Another technical decision by Saba Tavoosi was her inclusion of plots using only numeric or integer variables. I intend to do some further data wrangling to obtain the week day names and month names to include more descriptive labels for plots. Part the data wrangling included mostly numbers for week days rather than the week day name itself leading to some confusion what day number 4 means.
Saba Tavoosi focused more on exploratory data analysis (EDA) which
was thoroughly done well, much of which will be replicated in this
report. Tavoosi included only one model near the end on supervised data
but did not provide error metrics for the testing data set. In this
report, I hope to expand on this by splitting the training data set
further in order to be able to see how the model works on untested
dataset. The last step Tavoosi had in her report was to populate the
testing data set with predictions but without the actual revenue for
each film in the testing data set, it is impossible to determine the
accuracy of the model on untested data. I hope to extend from Tavoosi’s
modeling work by provding three or four more models such as a decision
tree and a linear regression. I will also explore taking out variables
such as popularity.
As mentioned before, a lot of trust goes behind the regular expressions to extract any meaningful analysis when using the variables with json data. Tavoosi could have done more to gain trust in her regular expressions by explainig the process and what information she was hoping to extract. The extent of her work includes regular expressions but quickly moves on to plotting without confirmation that the numbers she obtained were indeed, the ones she was expecting.
The data cleansing above consisted of removing three columns that
were not of immediate pertinence to this discussion. I ensured the
release date would be read as a date with the correct corresponding
years. I also created columns to describe what quarter of the year a
film was released, used regular expressions to remove gratuitous text in
the belongs_to_collection and
production_companies variables, and labeled which films
were released by top production companies which include Universal
Pictures, Paramount Pictures, Twentieth Century Fox Film Corporation,
Columbia Pictures, New Line Cinema, Warner Brothers, and Walt Disney
Pictures. I loaded the test data set and training data set separately as
that is the form in which I found them and joined them together to
create the data set films_all. Note, the test data set as
expected did not have a revenue column as that is the
response variable in question so films_all will have as
many cases where revenue is not available as there are number of cases
in the films_test.
The following plot should illustrate this fact:
films_all %>%
select(revenue) %>%
vis_miss()
It should be noted that the variables
belongs_to_collection, genres,
production_company, spoken_languages,
Keywords, cast, and crew were
imported as json data. Regular expressions, copied from Saba
Tavoosi have been used to extract information from each column.
I have decided to remove gratuitous variables that the author of this
report has deemed to have little use for future modeling. For example,
imdb_id is an identifier column and gives little
information about the content of the film. What the ID number may
communicate is the age of the film (lower ID numbers may be have been
released earlier as more recent films receive higher ID numbers) but the
age of the film can be better determined by its
release_year.
I have engineered the following new variables using information from the given columns from the Kaggle data set:
is_top_prod_comp : describes if a film was produced by
any of the top six production companies (Universal Pictures, Paramount
Pictures, Twentieth Century Fox, Columbia Pictures, New Line Cinema,
Warner Brothers, and Walt Disney Pictures.) labelled as “Yes”. If not,
is is labeled as “No”. Note the distinction between Twentieth Century
Fox and Walt Disney Pictures is permissible as this data set contains
films released before Disney’s acquisition of Twentieth Century Fox.
part_of_franchise : describes if a film is part of a
collection, franchise, or has a shared characters existing in other
films, labelled as “Yes”.If not, labeled as “No”.
order: describes if a film within this data set has
another film represented that comes after it in the same collection. The
variable orderhas two levels: “before” and “after”. Note,
since this data set is not a complete list of films and will frequently
omit films that started franchises, I have opted to use the vocabulary
“before” and “after” to identity within this data set which films have
come before other films in the same franchise. The words before and
after should not be used synchronously with original and sequel,
respectively, as it is quite possible to have a film in this data set be
considered the ‘first’ in its collection but still be a sequel. Also
note that due to the filtering join function performed to obtain a list
of films that are the “first” in their collection within this data set,
the data set will categorize all others as “after” is the only film in
the data set. The author and modeler of this report acknowledges the
imprecision in this data engineering and encourages the reader to
consider this variable to be of little importance in future models.
Before we filter out values in preparation for modeling, let’s examine which cases we will be removing and if they resemble a pattern. Ideally for the models to be unaffected by the absence of certain cases, we would like to see relative similarity between the density plots of revenue for cases that will be included and cases that we are planning to exclude.
vis_dat(films_train)
vis_miss(films_train)
gg_miss_var(films_train)
wes_colors4 <- wes_palette("GrandBudapest1", 4)
films_train %>%
ggplot() +
aes(x = revenue,
fill = filtered_na)+
geom_density() +
scale_x_log10() + #proportions larger revenue values to appear larger.
scale_fill_manual(values = wes_colors4)
It appears that we have the most missing values from our variable
budget with 812 missing values or NA values.
Our density plot clearly shows that removal of our NA values,
particularly those in budget requires us to alter our
interpretation of the model. The fact that the peak of the “excluded”
density falls lower in height (density) and in value (revenue) suggests
that many of the budgets that were left as NA in the original data set
had very low budgets. Our models then adjust to describe films that only
reside in the blue shaded area of films in the “included” portions where
the revenue is high. In other words, this model would be more
appropriately used for films expected to garner more at the box office
than independent films.
We then proceed to filter out the cases in the red shaded areas to be excluded:
films_train <- films_train %>%
filter(!is.na(budget), #filtering in preparation for modeling
!is.na(runtime), #filtering in preparation for modeling
!is.na(all_cast_size), #filtering in preparation for modeling
!is.na(female_cast_size), #filtering in preparation for modeling
!is.na(genre_count), #filtering in preparation for modeling
!is.na(production_companies)) #filtering in preparation for modeling
For the purposes of supervised modeling, I have decided to use only the testing data and split it further into training and testing. In this way, I will be able to evaluate the accuracy of my model in a supervised manner and calculate error metrics for my test data.
The top 5 cases for the training set are listed below:
set.seed(1)
film_split <- initial_split(films_train, prop = 9/10)
train <- training(film_split)
test <- testing(film_split)
# filter(!is.na(genre_count))
set.seed(616)
film_resamples <- vfold_cv(train, v = 5)
train %>%
head(5) %>%
select(original_title,
budget,
popularity,
production_companies,
release_date,
runtime,
revenue) %>%
reactable(filterable = TRUE,
searchable = TRUE,
minRows = 5)
The following code is taken and reworked from Saba Tavoosi
# Year released
year_plot <- films_train %>%
ggplot(aes(x = release_year,
y = revenue,
color = release_year)) +
geom_point() +
geom_smooth(method = 'lm', color = 'red3', fill = 'red3') +
scale_y_continuous(breaks = c(0, 500000000, 1000000000, 1500000000),
labels = c('$0', '$500', '$1000', '$1500')) +
theme_classic() +
theme(legend.position = 'none') +
labs(title = 'Revenue by year released', x = 'Release year', y = 'Revenue (Millions)')
# Quarter released
quarter_plot <- films_train %>%
ggplot(aes(x = factor(release_quarter),
y = revenue,
fill = factor(release_quarter))) +
stat_summary_bin(fun = median, geom = "bar") +
scale_y_continuous(breaks = c(0, 10000000, 20000000),
labels = c('$0', '$10', '$20')) +
theme_classic() +
theme(legend.position = 'none', axis.text.x = element_text(angle = 90)) +
scale_fill_manual(values = wes_colors4) +
labs(title = 'Revenue by quarter released',
x = 'Release quarter',
y = 'Median revenue (Millions)')
# Month released
month_plot <- films_train %>%
ggplot(aes(x = factor(release_month_abbr),
y = revenue,
fill = release_month)) +
stat_summary_bin(fun = median, geom = "bar") +
scale_y_continuous(breaks = c(0, 10000000, 20000000, 30000000),
labels = c('$0', '$10', '$20', '$30')) +
theme_light() +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45)) +
labs(title='Median revenue by month released', x='Release month', y='Median revenue (Millions)')
# Week released
week_plot <- films_train %>%
ggplot(aes(x = factor(release_week),
y = revenue,
fill = factor(release_week))) +
stat_summary_bin(fun = median, geom = "bar") +
scale_y_continuous(breaks = c(0, 20000000, 40000000, 60000000),
labels = c('$0', '$20', '$40', '$60')) +
theme_light() +
theme(legend.position = 'none', axis.text.x = element_text(angle = 90)) +
labs(title='Revenue by week released', x='Release week', y='Median revenue (Millions)')
# Weekday released
weekday_plot <- films_train %>%
ggplot(aes(x = reorder(factor(release_wday_name), release_wday),
y = revenue,
fill = release_wday_name)) +
stat_summary_bin(fun = median, geom = "bar") +
scale_y_continuous(breaks = c(0, 10000000, 20000000, 30000000),
labels = c('$0', '$10', '$20', '$30')) +
theme_light() +
theme(legend.position = 'none', axis.text.x = element_text(angle = 45)) +
labs(title = 'Revenue by weekday released', x='Release day', y='Median revenue (Millions)')
# Create a grid of the plots.
grid.arrange(year_plot, quarter_plot, month_plot, weekday_plot, week_plot,
layout_matrix = rbind(c(1, 2, 3),
c(5, 5, 4)))
The following code is taken from Saba Tavoosi
columnchart_by_year <- function(dataset, color, subtitle){
dataset %>%
group_by(release_year) %>%
count() %>%
ggplot() +
aes(x = release_year,
y = n,
fill = release_year) +
geom_col(fill = color) +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Distribution of Released Films by Release Year",
subtitle = subtitle,
x = "Year",
y = "Number of Films",
caption = "Source: kaggle")
}
films_train %>%
group_by(release_year) %>%
count() %>%
arrange(desc(n)) %>%
head(5) %>%
pander()
| release_year | n |
|---|---|
| 2010 | 95 |
| 2013 | 93 |
| 2016 | 93 |
| 2011 | 92 |
| 2015 | 85 |
films_test %>%
group_by(release_year) %>%
count() %>%
arrange(desc(n)) %>%
head(5) %>%
pander()
| release_year | n |
|---|---|
| 2014 | 197 |
| 2013 | 194 |
| 2011 | 187 |
| 2015 | 184 |
| 2010 | 180 |
year_column_chart_train <- columnchart_by_year(train, "lightcyan3", "from training dataset'")
year_column_chart_test <- columnchart_by_year(test, "lightcyan4", "from testing dataset'")
grid.arrange(year_column_chart_train,
year_column_chart_test,
nrow = 2)
The most popular years for a film in our data set to be released were 2013, 2014, and 2015 with 335, 320, and 312 films respectively. Our data set only had 1 film coming from 1923, 1924, 1929, 2018, 2021. We have a skewed left distribution of films across the years with the earliest coming from 1923 and the most recent from 2018.
It appears that our testing and training data set both follow a skewed left distribution with a peak near the 2010 and 2011 years. A conspicuous difference that will potentially affect the difference in accuracy our model will predict on the testing and testing data sets will be the noticeable dip in the number of films released in 2009 within the testing data set.
I have reservations about how this training data set mirrors the distribution of films in the real world. It appears that this data set has an oversampling bias for the early 2000’s. For example, Allen J Scott has found a more even distribution of films than the spike depicted in our data set. This data set also does not feature many releases from 2021 and 2022, thereby creating a distribution that poorly resembles the findings of Statista For example, I find it concerning that this data set doesn’t contain films from the most recent years, especially as these films may contain information related to the pandemic’s tumultuous effect on the film industry. In other words, the absence of these films from more recent years may potentially underestimate the effect the pandemic has had on future films. If the data set does contain films in more recent years, I would intend to engineer a new variable that classifies film as “pre-pandemic”, “immediately post-pandemic films” (i.e. films released one year of March 10, 2020) and “late post-pandemic films” (i.e. films released after March 10, 2022.).
The following code is taken from Saba Tavoosi
# Budget
films_train %>%
ggplot(aes(x = budget,
y = revenue,
color = budget)) +
geom_point() +
geom_smooth(method = 'gam',
color = "black",
fill = 'gray') +
# scale_x_log10()
scale_y_continuous(breaks = c(0, 500000000, 1000000000, 1500000000),
labels = c('$0', '$500', '$1000', '$1500')) +
scale_x_continuous(breaks = c(0, 50000000, 100000000, 200000000, 300000000),
labels = c('$0', '$50', '$100', '$200', '$300')) +
theme_classic() +
theme(legend.position = 'none') +
labs(title = 'Revenue by budget', x = 'Budget (Millions)', y = 'Revenue (Millions)')
While not too many observations can be drawn from this preliminary graph, one can see how in general, a higher budget film may lead to higher revenue. This is of course a trend with many opportunities to show a high budget film with a low revenue and vice versa.
The following code is taken from Saba Tavoosi
# Movie counts by main genre
films_train %>%
ggplot() +
aes(x = fct_infreq(main_genre),
fill = main_genre) +
geom_bar() +
theme_bw() +
coord_flip() +
theme(legend.position = 'none') +
labs(title = 'Genre by count', x = 'Genre', y = 'count')
#Boxplot per Genre
films_train %>%
ggplot(aes(x = fct_infreq(main_genre),
y = revenue,
fill = main_genre)) +
stat_summary() +
coord_flip() +
scale_y_continuous(breaks = c(0, 50000000, 100000000, 200000000),
labels = c('$0', '$50', '$100', '$200')) +
coord_flip() +
theme_classic() +
theme(legend.position = 'none') +
labs(title = 'Median revenue by genre', x = 'Genre', y = 'Median revenue (Millions)')
The bar plot above shows that Drama, Comedy, and Action were the most
popular film releases in our data set. On the other end of the spectrum,
Foreign films, History, and Western were the last popular. It should be
noted that the classifications of genre may, out of simplification,
choose drama and comedy as they are typically the dominant genre to
describe a film. For example, a murder comedy mystery may be classified
as a comedy. A further analysis could break these genres into further
subsets and capture the nuances of more specified genres and how it
affect revenue.
The second plot shows a five number summary (Min, Q1, Median, Q3, Max) breakdown by Genre. We can observe that Science Fiction has a wide distribution whereas as Adventure, with a smaller distance between their minimum and maximum values, generally performs the best out of all the other genres. On the other end, we see that Documentary films and foreign films on average, tend to have the lowest box office sales when compared to other genres.
The following code is taken from Saba Tavoosi
#Place this in the clean dataset
films_train %>%
filter(!is.na(top_prod_comp)) %>%
ggplot() +
aes(x = top_prod_comp,
y = revenue,
fill = top_prod_comp) +
geom_boxplot() +
scale_y_continuous(breaks = c(0, 500000000, 1000000000, 1500000000),
labels = c('$0M', '$500', '$1000', '$1500')) +
coord_flip() +
theme_classic() +
theme(legend.position = 'none') +
labs(title = 'Revenue by top production companies',
x = 'Top production companies', y = 'Revenue (Millions)')
When broken down by Production Companies, we see that Walt Disney seems to outperform its competitors with the highest median and Q3. While Universal Pictures and Paramount Pictures may have a smattering of films that have higher box office sales than Walt Disney highest grossing film, the body of these two company s’ inner quartile range is solidly lower than Disney median.
However, the author of this report suggests the reader to apply the results of this boxplot only to the data set available. For example, it is evident that Walt Disney has produced a number of top grossing films but are not represented in some of the largest outliers. For example, from this plot, I was able to confirm that this box plot does not include Avatar or Avengers:Endgame the highest and second highest grossing films of all time, respectively.
The following code is taken from Saba Tavoosi
films_train %>%
group_by(release_year) %>%
summarize(average_boxoffice = mean(revenue)) %>%
ggplot() +
aes(x = release_year,
y = average_boxoffice) +
geom_col() +
theme_minimal() +
theme(legend.position = "none") +
labs(title = "Average Box Office Sales ",
subtitle = "by Release Year",
x = "Year",
y = "Average Box Office Sales",
caption = "Source: kaggle, trained data")
As predicted, with a combination of more volume increase and inflation, the overall average box office sales per year has increased since 1925 according to our data. The revenue values do not account for inflation.
# Movie Count by Sequel Stats
films_train %>%
ggplot() +
aes(x = fct_infreq(order),
fill = order) +
geom_bar() +
theme_bw() +
coord_flip() +
theme(legend.position = 'none') +
labs(title = 'Order by count', x = 'Order', y = 'Count') +
scale_fill_manual(values = wes_colors4)
# Revenue against Order of film in franchise
films_train %>%
ggplot(aes(x = order,
y = revenue,
fill = order)) +
geom_boxplot() +
scale_fill_manual(values = wes_colors4) +
theme(legend.position = "none")
Our results indicate that our dataset contained more films classified as “after” but films classified as “before” overall performed better at the box office as depicted in the boxplot.
See above after the filtering of the data set to see the training and testing split.
I aim to use the variable revenue as the exploratory
variable with budget, popularity,
production_companies, production_countries,
release_date, runtime,
release_year, release_month,
release_quarter, release_week,
release_wday, all_cast_size,
female_cast_size, male_cast_size, and
genre_count as explanatory variables.
Both of these two models will use the same formula.
film_formula1 <- revenue ~ budget + popularity + production_companies + release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + is_english + is_top_prod_comp
formula1 <- revenue ~ budget + popularity + release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + all_cast_size + female_cast_size + male_cast_size + genre_count
linreg_recipe1 <- recipe(
film_formula1,
data = train) %>%
step_other(production_companies)
linreg_workflow3 <- workflow(
preprocessor = linreg_recipe1,
spec = linear_reg()
)
#Using Cross Validation
model1_samples <- fit_resamples(linreg_workflow3,
resamples = film_resamples,
metrics = metric_set(mae))
model1_samples %>%
collect_metrics(summarize = TRUE)
## # A tibble: 1 × 6
## .metric .estimator mean n std_err .config
## <chr> <chr> <dbl> <int> <dbl> <chr>
## 1 mae standard 56.6 5 2.04 Preprocessor1_Model1
I proceed to create three new models: A linear regression, decision
tree, and a second random forest model. These three models will use the
same formula with the same explanatory variables. However, this formula
includes all_cast_size,female_cast_size,
male_cast_size, and genre_count while
eliminating production_companies.
formula1 <- revenue ~ budget + popularity + release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + all_cast_size + female_cast_size + male_cast_size + genre_count
#creating a function to fit a model given a the model type.
fit_model <- function(type){
model_formula <- formula1
spec <- type(mode = "regression")
name <- fit(
spec,
model_formula,
data = train)
name
}
#creating a function to plot a model on a budget v. revenue scatterplot
model_scatter_plot <- function(model){
augment(model, train) %>%
ggplot(aes(x = budget, y = .pred)) +
geom_line() +
geom_point(aes(y = revenue), alpha = .05) +
scale_x_log10()
}
#Fitting Model
linear_model1 <- fit_model(linear_reg)
#Plotting Model
model_scatter_plot(linear_model1)
dectree_recipe <- recipe(formula1,
data = train)
dectree_workflow <- workflow(
preprocessor = dectree_recipe,
spec = decision_tree(mode = "regression",
tree_depth = 5,
cost_complexity = .000001,
min_n = 2)
)
#Fitting Model
dt_model <- fit(dectree_workflow, data = train)
#Plotting Model
model_scatter_plot(dt_model)
#Decision Tree Plot
dt_model %>%
extract_fit_engine() %>%
rpart.plot(roundint = FALSE, digits = 3, type = 5)
For context, Saba
Tavoosi created a random forest model with an accompanying
importance plot that communicated that budget,
runtime, and popularity were the three most
important variables in her model. This importance plot was derived from
a correlation plot of all of her variables.
Our Decision Tree plot confirms the importance of
budget, runtime and the
popularity in predicting a film’s revenue. We
find that our decision tree model first split on budget, with large
majority of films having been produced with a budget under 103 million
dollars. We notice an overall pattern where films on the right side of
the decision tree plot that have been categorized as having a budget
over 103 million dollars have predictor values in their terminal nodes
that are higher than the terminal node values on the left hand side. For
example, our particular decision tree model, dt_model,
would predict that a film that has a budget of 98 million dollars, with
a popularity score of 12, would make around 250 million dollars along
with 2.1% of the films in the training data set. A producer looking to
be in the exclusive club of passing the 1 billion dollar mark would pay
close attention to routes to the terminal nodes with values greater than
1,000 (in millions). In this case, a our model would predict any film
that had a budget over 290 million dollars, had a popularity score above
or equal to 26.7, and had less than 3.5 genres. Overall, I believe this
decision tree is making logical conclusions about the data set.
formula1 <- revenue ~ budget + popularity + release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + all_cast_size + female_cast_size + male_cast_size + genre_count
randforest_recipe <- recipe(
formula1,
data = train) %>%
step_normalize(all_numeric_predictors())
randforest_workflow <- workflow(
preprocessor = randforest_recipe,
spec = rand_forest(mode = "regression")
)
#Fitting Model
rf_model <- fit(randforest_workflow, data = train)
#Plotting Model
model_scatter_plot(linear_model1)
Now evaluating the accuracy on the test dataset.
# MAE Error Metric Function given one of the fitted models
mae_error_metrics <- function(model){
augment(model, test) %>%
mae(truth = revenue, estimate = .pred) %>%
pander()
}
# MAPE Error Metric Function given one of the fitted models
mape_error_metrics <- function(model){
augment(model, test) %>%
mape(truth = revenue, estimate = .pred) %>%
pander()
}
#Error Metrics
mae_error_metrics(linear_model1)
| .metric | .estimator | .estimate |
|---|---|---|
| mae | standard | 59.3 |
mape_error_metrics(linear_model1)
| .metric | .estimator | .estimate |
|---|---|---|
| mape | standard | 102109 |
mae_error_metrics(dt_model)
| .metric | .estimator | .estimate |
|---|---|---|
| mae | standard | 57.45 |
mape_error_metrics(dt_model)
| .metric | .estimator | .estimate |
|---|---|---|
| mape | standard | 22013 |
mae_error_metrics(rf_model)
| .metric | .estimator | .estimate |
|---|---|---|
| mae | standard | 50.89 |
mape_error_metrics(rf_model)
| .metric | .estimator | .estimate |
|---|---|---|
| mape | standard | 13277 |
pred_vs_obs <- function(model, subtitle){
augment(model, test) %>%
ggplot(aes(x = revenue,
y = .pred,
color = is_top_prod_comp)) +
geom_point(alpha = .85) +
coord_obs_pred() +
geom_abline() +
labs(title = "Predicted v. Observed Scatterplot",
subtitle = subtitle,
x = "Observed",
y = "Predicted",
color = "Top Production Company") +
scale_color_manual(values = wes_colors4)
} #function to create a predicted v. observed plot
pred_vs_obs(linear_model1, "linear regression")
pred_vs_obs(dt_model, "decision tree")
pred_vs_obs(rf_model, "random forest")
glue("We have obatined a mean aboslute error of {mae_error_metrics(linear_model1)[3]}, {mae_error_metrics(dt_model)[3]}, and {mae_error_metrics(rf_model)[3]} for our linear regression, decission tree regression, and random forest regreesion models, respectively.")
## We have obatined a mean aboslute error of NA, NA, and NA for our linear regression, decission tree regression, and random forest regreesion models, respectively.
glue("We have obatined a mean aboslute percentage error of {mape_error_metrics(linear_model1)[3]}, {mape_error_metrics(dt_model)[3]}, and {mape_error_metrics(rf_model)[3]} for our linear regression, decission tree regression, and random forest regreesion models, respectively.")
## We have obatined a mean aboslute percentage error of NA, NA, and NA for our linear regression, decission tree regression, and random forest regreesion models, respectively.
Our findings show that our ‘best’ model with the lowest range of error (MAE) was the second random forest model, though even this model could be improved.
In our Predicted v. Observed plots, we can confirm that our second random forest model performed the best as many of the data points fell close to the diagonal black line. A perfect fit would have all data points lined up on the diagonal black line signifying that the predicted value is the same as the observed value. This was not apparent in the linear regression model and especially not in the decision tree model. In both the linear regression model and our second random forest model, we noticed that the model would start to underestimate the film’s revenue with larger budgets.
Given the instances in the discrepancy between the films represented in the data set used to create our four models and the total number of films, I would advise the reader to limit their applications of these models on other films.
The results of the models, particularly the random forest and
decision tree models indicate that a larger budget, in general, results
in a larger box office return (revenue). Other important
factors to consider are the popularity of films. This, however, is
unhelpful to a line producer as this data point is recorded after the
release of a film where the budget has already been finalized. One way
to work with this data is to examine more closely the content of your
upcoming film and how it may compare to your previous popular films
within your . It is apparent and confirmed in these particular models
that people are more willing to see more popular films so it is
advisable to produce a film that mirrors successful models. However,
this model fails to capture the value of originality which affects the
popularity of a film (i.e. Everything
Everywhere All At Once, an innovative film and largest grossing
film for A24).
Another large factor to consider is that the variable
popularity is retrospective and essentially allows the
model to cheat by using data after a film has been released. The
popularity variable, similar to the our response
revenue variable is only known after the film’s release or
after both critics and audiences have had a chance to see the film. In
other words, in actual application, it would be impossible to include
popularityin our model as it would be determined after the
response variable itself. Potentially, one could use the rotten tomato
score or the early IMDB rating from early viewers as a substitute
variable to use in place of popularity. However, this risks
including misleading information from review bombing where a film can
still be successful despite a number of passionate film reviewers giving
negative reviews early on before the film has even been released. What
happens to our models if we remove the variable popularity
from our model formulas? Let’s examine the best model we had, the random
forest model. We use the same preprocessing and error metrics but only
alter the formula to not include popularity
popularityformula2 <- revenue ~ budget + release_date + runtime + release_year + release_month + release_quarter + release_week + release_wday + all_cast_size + female_cast_size + male_cast_size + genre_count #removed popularity
randforest_recipe <- recipe(
formula2,
data = train) %>%
step_normalize(all_numeric_predictors())
randforest_workflow <- workflow(
preprocessor = randforest_recipe,
spec = rand_forest(mode = "regression")
)
#Fitting Model
rf_model2<- fit(randforest_workflow, data = train)
#Error Metrics
mae_error_metrics(rf_model)
| .metric | .estimator | .estimate |
|---|---|---|
| mae | standard | 50.89 |
mae_error_metrics(rf_model2)
| .metric | .estimator | .estimate |
|---|---|---|
| mae | standard | 54.71 |
As predicted, our model has worsened in its accuracy when we take out
one of its key predictor variables. Intuitively, popularity
could potentially have a correlation with revenue.
While I have been critical of the representation of the data, I commend the data set for having including detailed information about each film that would be able to add nuance to the model. In other words, while the data set may be lacking in quality of the rows, it compensates with the quality of the columns. However, there still exists a multitude of other factors that can be strong determinants a box office success.
As iterated in the executive summary, I hold reservations from solely using this model in your exploration. Consider other models that are able to process film data about differentiating creative inputs (i.e. Directors, use of color palette, cinematographers, level of profanity/nudity/thematic content) that will affect the quality of the film and its likelihood to be well received by your specific demographics for whom the film was marketed. Consider also the ethical implications that this model fails to address. Consider the legal ramifications if the casting decides show gender bias based on the evidence of this model.
Continually seek better data that can generate more insights into the film industry and what are the key drivers that drive up the willingness to pay. The author of this report encourages seeking a balance between creative integrity of film-making and the business end of the industry.
Saba Tavoosi: https://www.kaggle.com/code/tavoosi/predicting-box-office-revenue-with-random-forest